Is it possible to have an indexed view in MySQL? - sql

I found a posting on the MySQL forums from 2005, but nothing more recent than that. Based on that, it's not possible. But a lot can change in 3-4 years.
What I'm looking for is a way to have an index over a view but have the table that is viewed remain unindexed. Indexing hurts the writing process and this table is written to quite frequently (to the point where indexing slows everything to a crawl). However, this lack of an index makes my queries painfully slow.

I don't think MySQL supports materialized views which is what you would need, but it wouldn't help you in this situation anyway. Whether the index is on the view or on the underlying table, it would need to be written and updated at some point during an update of the underlying table, so it would still cause the write speed issues.
Your best bet would probably be to create summary tables that get updated periodically.

Have you considered abstracting your transaction processing data from your analytical processing data so that they can both be specialized to meet their unique requirements?
The basic idea being that you have one version of the data that is regularly modified, this would be the transaction processing side and requires heavy normalization and light indexes so that write operations are fast. A second version of the data is structured for analytical processing and tends to be less normalized and more heavily indexed for fast reporting operations.
Data structured around analytical processing is generally built around the cube methodology of data warehousing, being composed of fact tables that represent the sides of the cube and dimension tables that represent the edges of the cube.

Flexviews supports materialized views in MySQL by tracking changes to underlying tables and updating the table which functions as a materialized view. This approach means that SQL supported by the view is a bit restricted (as the change logging routines have to figure out which tables it should track for changes), but as far as I know this is the closest you can get to materialized views in MySQL.

Do you only want one indexed view? It's unlikely that writing to a table with only one index would be that disruptive. Is there no primary key?
If each record is large, you might improve performance by figuring out how to shorten it. Or shorten the length of the index you need.
If this is a write-only table (i.e. you don't need to do updates), it can be deadly in MySQL to start archiving it, or otherwise deleting records (and index keys), requiring the index to start filling (reusing) slots from deleted keys, rather than just appending new index values. Counterintuitive, but you're better off with a larger table in this case.

Related

Is it a good idea to create tables dynamically to store user-content?

I'm currently designing an application where users can create/join groups, and then post content within a group. I'm trying to figure out how best to store this content in a RDBMS.
Option 1: Create a single table for all user content. One of the columns in this table will be the groupID, designating which group the content was posted in. Create an index using the groupID, to enable fast searching of content within a specific group. All content reads/writes will hit this single table.
Option 2: Whenever a user creates a new group, we dynamically create a new table. Something like group_content_{groupName}. All content reads/writes will be routed to the group-specific dynamically created table.
Pros for Option 1:
It's easier to search for content across multiple groups, using a single simple query, operating on a single table.
It's easier to construct simple cross-table queries, since the content table is static and well-defined.
It's easier to implement schema changes and changes to indexing/triggers etc, since there's only one table to maintain.
Pros for Option 2:
All reads and writes will be distributed across numerous tables, thus avoiding any bottlenecks that can result from a lot of traffic hitting a single table (though admittedly, all these tables are still in a single DB)
Each table will be much smaller in size, allowing for faster lookups, faster schema-changes, faster indexing, etc
If we want to shard the DB in future, the transition would be easier if all the data is already "sharded" across different tables.
What are the general recommendations between the above 2 options, from performance/development/maintenance perspectives?
One of the cardinal sins in computing is optimizing too early. It is the opinion of this DBA of 20+ years that you're overestimating the IO that's going to happen to these groups.. RDBMS's are very good at querying and writing this type of info within a standard set of tables. Worst case, you can partition them later. You'll have a lot more search capability and management ease with 1 set of tables instead of a set per user.
Imagine if the schema needs to change? do you really want to update hundreds or thousands of tables or write some long script to fix a mundane issue? Stick with a single set of tables and ignore sharding. Instead, think "maybe we'll partition the tables someday, if necessary"
It is a no-brainer. (1) is the way to go.
You list these as optimizations for the second method. All of these are misconceptions. See comments below:
All reads and writes will be distributed across numerous tables, thus
avoiding any bottlenecks that can result from a lot of traffic hitting
a single table (though admittedly, all these tables are still in a
single DB)
Reads and writes can just as easily be distributed within a table. The only issue would be write conflicts within a page. That is probably a pretty minor consideration, unless you are dealing with more than dozens of transactions per second.
Because of the next item (partially filled pages), you are actually much better off with a single table and pages that are mostly filled.
Each table will be much smaller in size, allowing for faster lookups,
faster schema-changes, faster indexing, etc
Smaller tables can be a performance disaster. Tables are stored on data pages. Each table is then a partially filled page. What you end up with is:
A lot of wasted space on disk.
A lot of wasted space in your page cache -- space that could be used to store records.
A lot of wasted I/O reading in partially filled pages.
If we want to shard the DB in future, the transition would be easier
if all the data is already "sharded" across different tables.
Postgres supports table partitioning, so you can store different parts of a table in different places. That should be sufficient for your purpose of spreading the I/O load.
Option 1: Performance=Normal Development=Easy Maintenance=Easy
Option 2: Performance=Fast Development=Complex Maintenance=Hard
I suggest to choose the Oprion1 and for the BIG table you can manage the performance with better indexes or cash indexes (for some DB) and the last thing is nothing help make the second Option 2, because development a maintenance time is fatal factor

SQL Database Best Practices - Use of Archive tables?

I'm not a trained DBA, but perform some SQL tasks and have this question:
In SQL databases I've noticed the use archive tables that mimic another table with the exact same fields and which are used to accept rows from the original table when that data is deemed for archiving. Since I've seen examples where those tables reside in the same database and on the same drive, my assumption is that this was done to increase performance. Such tables didn't have more than a about 10 million rows in them...
Why would this be done instead of using a column to designate the status of the row, such as a boolean for an in/active flag?
At what point would this improve performance ?
What would be the best pattern to structure this correctly, given that the data may still need to be queried (or unioned with current data) ?
What else is there to say about this ?
The notion of archiving is a physical, not logical, one. Logically the archive table contains the exact same entity and ought to be the same table.
Physical concerns tend to be pragmatic. The overarching notion is that the "database is getting too (big/slow"). Archiving records makes it easier to do things like:
Optimize the index structure differently. Archive tables can have more indexes without affecting insert/update performance on the working table. In addition, the indexes can be rebuilt with full pages, while the working table will generally want to have pages that are 50% full and balanced.
Optimize storage media differently. You can put the archive table on slower/less expensive disk drives that maybe have more capacity.
Optimize backup strategies differently. Working tables may require hot backups or log shipping while archive tables can use snapshots.
Optimize replication differently, if you are using it. If an archive table is only updated once per day via nightly batch, you can use snapshot as opposed to transactional replication.
Different levels of access. Perhaps you want different security access levels for the archive table.
Lock contention. If you working table is very hot you'd rather have your MIS developers access the archive table where they are less likely to halt your operations when they run something and forget to specify dirty read semantics.
The best practice would not to use archive tables but to move the data from the OLTP database to an MIS database, data warehouse, or data marts with denormalized data. But some organizations will have trouble justifying the cost of an additional DB system (which aren't cheap). There are far fewer hurdles to adding an additional table to an existing DB.
I say this frequently, but...
Multiple tables of identical structure almost never makes sense.
A status flag is a much better idea. There are proper ways to increase performance (partitioning/indexing) without denormalizing data or otherwise creating redundancies. 10 million records is pretty small in the world of modern rdbms, so what you're seeing is the product of poor planning or misunderstanding of databases.

Rule of thumb when to partition tables

I've seen many entries about partitioning tables, but there is not a lot of information on when you should make a partition.
Is there a rule of thumb when you should partition tables in SQL Server.
Thanks
My benchmarks indicate that it depends on the query load.
If the queries you perform ALWAYS contain a filter on the partition field the performance benefit is virtually instant (like 1000 records in the table is already beneficial)
If the queries do NOT always contain a filter on the partition field you really have to benchmark with a good sample of the query load before making the decision.
You also have to account for the partition system you use. if you use "static" partitions there is not much harm in creating them immediately. When you use a "sliding window" system you need to take into account the overhead of creating and merging partitions. (which can take a long time on big tables)
#Filip's post is a great topical guide. When your doing your ontologies and estimating how your application will be used, that is, how your users will interact with the application and how that translates to database access, you should have a good idea of the kind of queries that will be performed and how fast certain tables will grow. If your that confident, then you should immediately partition the tables to defer from any maintenance hapzards.
But if your trying to decide on whether to partition populated tables, or you like to perform partitioning lazily like me, here's a nice little nugget from the PostgreSQL docs:
The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server. [src]

Postgresql Application Insertion and Trigger Performance

I'm working on designing an application with a SQL backend (Postgresql) and I've got a some design questions. In short, the DB will serve to store network events as they occur on the fly, so insertion speed and performance is critical due 'real-time' actions depending on these events. The data is dumped into a speedy default format across a few tables, and I am currently using postgresql triggers to put this data into some other tables used for reporting.
On a typical event, data is inserted into two different tables each share the same primary key (an event ID). I then need to move and rearrange the data into some different tables that are used by a web-based reporting interface. My primary goal/concern is to keep the load off the initial insertion tables, so they can do their thing. Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
Once the data is in the appropriate reporting tables, I won't need to hang on to the data in the insertion tables too long, so I'll keep those regularly pruned for insertion performance. In thinking about this scenario, which I'm sure is semi-common, I've come up with three options:
Use triggers to trigger on the initial row insert and populate the reporting tables. This was my original plan.
Use triggers to copy the insertion data to a temporary table (same format), and then trigger or cron to populate the reporting tables. This was just a thought, but I figure that a simple copy operation to a temp table will offload any of the query-ing of the triggers in the solution above.
Modify my initial output program to dump all the data to a single table (vs across two) and then trigger on that insert to populate the reporting tables. So where solution 1 is a multi-table to multi-table trigger situation, this would be a single-table source to multi-table trigger.
Am I over thinking this? I want to get this right. Any input is much appreciated!
You may experience have a slight increase in performance since there are more "things" to do (although they should not affect operations in any way). But using Triggers/other PL is a good way to reduce it to minimum subce they are executed faster than code that gets sent from your application to the DB-Server.
I would go with your first idea 1) since it seems to me the cleanest and most efficient way.
2) is the most performance hungry solution since cron will do more queries than the other solutions that use server-side functions. 3) would be possible but will resulst in an "uglier" database layout.
This is an old one but adding my answer here.
Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
That may be way off, I'm afraid, but under a few cases it may not be. It depends on the effects of caching on the reports. Keep in mind that disk I/O and memory are your commodities, and that writers and readers rarely block eachother on PostgreSQL (unless they explicitly elevate locks--- a SELECT ... FOR UPDATE will block writers for example). Basically if your tables fit comfortably in RAM you are better off reporting from them since you are keeping disk I/O free for the WAL segment commit of your event entry. If they don't fit in RAM then you may have cache miss issues induced by reporting. Here materializing your views (i.e. making trigger-maintained tables) may cut down on these but they have a significant complexity cost. This, btw, if your option 1. So I would chalk this one up provisionally as premature optimization. Also keep in mind you may induce cache misses and lock contention on materializing the views this way, so you might induce performance problems regarding inserts this way.
Keep in mind if you can operate from RAM with the exception of WAL commits, you will have no performance problems.
For #2. If you mean temporary tables as CREATE TEMPORARY TABLE, that's asking for a mess including performance issues and reports not showing what you want them to show. Don't do it. If you do this, you might:
Force PostgreSQL to replan your trigger on every insert (or at least once per session). Ouch.
Add overhead creating/dropping tables
Possibilities of OID wraparound
etc.....
In short I think you are overthinking it. You can get very far by bumping RAM up on your Pg box and making sure you have enough cores to handle the appropriate number of inserting sessions plus the reporting one. If you plan your hardware right, none of this should be a problem.

What is the best way to partition large tables in SQL Server?

In a recent project the "lead" developer designed a database schema where "larger" tables would be split across two separate databases with a view on the main database which would union the two separate database-tables together. The main database is what the application was driven off of so these tables looked and felt like ordinary tables (except some quirky things around updating). This seemed like a HUGE performance problem. We do see problems with performance around these tables but nothing to make him change his mind about his design. Just wondering what is the best way to do this, or if it is even worth doing?
I don't think that you are really going to gain anything by partitioning the table across multiple databases in a single server. All you have essentially done there is increased the overhead in working with the "table" in the first place by having several instances (i.e. open in two different DBs) of it under a single SQL Server instance.
How large of a dataset do you have? I have a client with a 6 million row table in SQL Server that contains 2 years worth of sales data. They use it transactionally and for reporting without any noticiable speed problems.
Tuning the indexes and choosing the correct clustered index is crucial to performance of course.
If your dataset is really large and you are looking to partition, you will get more bang for your buck partitioning the table across physical servers.
Partitioning is not something to be undertaken lightly as there can be many subtle performance implications.
My first question is are you referring simply to placing larger table objects in separate filegroups (on separate spindles) or are you referring to data partitioning inside of a table object?
I suspect that the situation described is an attempt to have the physical storage of certain large tables on different spindles from the rest of the tables. In this case, adding the extra overhead of separate databases, losing any ability to enforce referential integrity across databases, and the security implications of enabling cross-database ownership chaining does not provide any benefit over using multiple filegroups within a single database. If, as is quite possible, the separate databases you refer to in your question are not even stored on separate spindles but are all stored on the same spindle then you negate even the slight performance benefit you could have gained by physically separating your disk activity and have received absolutely no benefit.
I would suggest instead of using additional databases to hold large tables you look into the Filegroup topic in SQL Server Books Online or for a quick review see this article:
If you are interested in data partitioning (including partitioning into multiple file groups) then I recommend reading articles by Kimberly Tripp, who gave an excellent presentation at the time SQL Server 2005 came out about the improvements available there. A good place to start is this whitepaper
Which version of SQL Server are you using? SQL Server 2005 has partitioned tables, but in 2000 (or 7.0) you needed to use partition views.
Also, what was the reasoning for putting the table partitions in a separate database?
When I've had to partition tables in the past (pre-2005), it's usually by a date column or something similar, with a view over the various partitions. Books Online has a section that talks about how to do this and all of the rules around it. You need to follow the rules to make it work how it's supposed to work.
The key thing to remember is that your partitioning column must be part of the primary key and you want to try to always use that column in any access against the table so that the optimizer can ignore partitions that shouldn't be affected by the query.
Look up "partitioned table" in MSDN and you should be able to find a more complete tutorial for SQL Server 2005 partitioned tables as well as advice on how to set them up for maximum performance.
Are you asking about best practices in terms of database design, or convincing your lead to change his mind? :)
In terms of design... Back in the goode olde days, vertical partitioning was sometimes needed to work around database engine limitations, where the number of columns in a table was a hard limit, like 255 columns. These days the main benefits are purely for performance: putting rarely used columns, or blobs on a separate disk array. But if you're regularly pulling things from both tables it will likely be a loss. It sounds like your lead is suffering from a case of premature optimisation.
In terms of telling your lead is wrong... that requires diplomacy. If he's aware of mutterings of discontent in terms of performance, a benchmark is probably the best way to show the difference.
Create a new physical table somewhere with 'create table t1 as select * from view1' and then run some lengthy batch with the vertically partitioned table and your new table. If it's as bad as you say, the difference should be evident.
But this too may be premature optimisation. Find out what the end-users think of the performance. If the performance is good enough, for some definition of good, then don't fix what ain't broke.
There is a definite benefit for table partitioning (regardless whether it's on same or different filegroups /disks). If the partition column is correctly selected, you'll realize that your queries will hit only the required partition. So imagine if you have 100 million records (I've partitioned tables much bigger than that - about 20+ Billion rows) and if for the most part, more than 70% of your data access is only a certain category or timeline or type of data then it helps to keep the most accessed data in a separate partition. Plus you can align the partition with separate file groups with various type of disks (SATA, Fiber channel, SSDs) so that the most accessed/busy data are on the fastest storage and the least/rarely accessed are virtually on slower disks.
Although, in SQL Server there's limited partitioning ability, unlike Oracle. You can choose only one column for partitioning (even in SQL 2008). So you've to choose a column wisely where that column also is part of most of your frequent queries. For the most part, people find it easy to choose to partition by a date column. However although it seems logical to partition that way, if your queries do not have that column as part of the condition, you won't be gaining sufficient benefits from partitioning (in other words, your query will hit all the partition regardless).
It's much easier to partition for data warehouse/data mining type databases than OLTP as most DW database queries are limited by time period.
That's why these days due to the volume of data being handled by databases, it's wise to design the application in such a way that ever query is limited by some broader group such as time, geographical location or such so that when such columns are chosen for partitioning you'll gain maximum benefits.
I would disagree with the assumption that nothing can be gained by partitioning.
If the partition data is physically and logically aligned, then the potential IO of queries should be dramatically reduced.
For example, We have a table which has the batch field as an INT representing an INT.
If we partition the data by this field and then re-run a query for a particular batch, we should be able to run set statistics io ON before and after partitioning and see a reduction in IO,
If we have a million rows per partition and each partition is written to a separate device. The query should be able to eliminate the nonessential partitions.
I've not done a lot of partitioning on SQL Server, but I do have experience of partitioning on Sybase ASE, and this is known as partition eliminiation. When I have time I'm going to test out the scenario on a SQL Server 2005 machine.