I have a MySQL(innodb) table 'items' with the following characteristics
Large number of rows, and keeps on increasing.
Large number of columns of various data-types including 'text';
primary key 'item_id' is present.
There are additional requirements as follows:
Need to query items based on their status
Need to update status
The above two operations happen quite frequently.
Given the above scenario I have two questions
Would making a separate table with two columns namely item_id and status with item_id as primary key provide increased performance?
If the above is true, how am I going to tackle querying item_ids based on status?
I am inexperienced in handling databases. I hope you will bear with me :)
This is called vertical segmentation. It is often used when a data entity has multiple access patterns which access different subsets of the entities attributes (table columns), with different frequencies. If one function needs access to only one or two columns 100s of times per second, and another application function needs access to all the other columns, but only once or twice a day, then this approach is warrented, and will garner substantial perfomance improvement.
Basically, as you suggested, you "split" the table into two tables, both with the same key, with a one-to-one FK/PK->PK relationship. In one table you put only those few columns that are accessed more frequently, and you put the rest of the columns in the other table that will be accessed less frequently. You can then apply indexing to each table more appropriately based on the actual access pattern for each table separately.
Would make more sense to create an index on your status and your item_id if its the only columns you need to fetch.
create index status_item_id_items on items (status)
You can then query your result that will use this index:
select item_id, status from items where status = 'status'
Keep in mind that if you don't have many different statuses your query may ends up returning a lot of row and could be slow. If you can be constrained by a more 'selective' column like a datetime it would be better.
Answering part 2 first, you'd do an inner join of your two tables:
SELECT i.*, s.StatusCode FROM items AS i INNER JOIN status AS s ON s.item_id = i.item_id
To answer part 1, though, I don't think doing this would gain you any performance advantage.
Related
I have 2 tables from which I'm joining certain columns. They are joined on a VARCHAR column (indexed in both tables). Table A has a bit over 800.000 records and Table B has 20.000 records.
Table A has an auto_inc primary key. Table B does not have a primary key, only the index on the mentioned VARCHAR column.
The query takes about 48 seconds which is too slow. What can I do to increase the speed? Would it help to create a primary key auto_incr in table B? Even if this is not the column on which the join takes place?
Beginning user in SQL. Both tables are InnoDB and I use Mariadb.
QUERY:
select distinct
`pr`.`ProductIdentifier` AS `ProductIdentifier`,
`pr`.`Datum` AS `Datum`,
`pr`.`Retailer` AS `Retailer`,
`pr`.`Prijs` AS `Prijs`,
`pm`.`Merk` AS `Merk`,
`pm`.`Product` AS `Product`,
`pm`.`Formaat` AS `Formaat`
from
(`prices`.`prices_table` `pr`
join `prices`.`product_match_table` `pm`
on(`pr`.`ProductIdentifier` = `pm`.`ProductIdentifier`))
EXPLAIN SELECT:
Explain table
This answer is based on my knowledge of indexing in general; MariaDB may have some more specialised options I am not aware of.
However, indexes broadly speed up queries in two ways
By only having the columns needed, meaning less data to read and process
By being sorted in an appropriate manner to help processing
For the first, you typically need a covering index.
For the second, this includes
Being sorted the same way (e.g., indexed on the same fields) as tables it is being JOINed to in the query
Being sorted so that WHERE clauses and other types of filtering can directly use the sort to go to the appropriate spot in the index/table
In practice, often the best improvement in performance is that last one - however you do not have WHERE clauses in your code there. If (as is typical) the users filter the results (e.g., only show me results where ProductName = 'Handbag') then you may need to adjust the indexes for those (more on that a bit later though).
Covering indexes for query above
I think with the current query (and no filtering etc) the fastest you can get is with two indexes
CREATE INDEX `IX_prices_ProductIdentifier` ON `prices`.`prices_table`
(`ProductIdentifier`,
`Datum`,
`Retailer`,
`Prijs`);
CREATE INDEX `IX_productmatch_ProductIdentifier` ON `prices`.`product_match_table`
(`ProductIdentifier`,
`Merk`,
`Product`,
`Formaat`);
These provide covering indexes on the query shown, and are both sorted the same (by productIdentifier) to make the join easier.
Searching/filtering (not specified in initial example)
However, if people often search by a specific field first, then it makes sense to re-order the fields in the relevant table (so the searched field is first), or have multiple indexes with the search field at the front.
For example, people may be able to search for specific values in pr.Retailer, pm.Merk, or pm.Product. You may therefore add these additional indexes
CREATE INDEX `IX_prices_Retailer` ON `prices`.`prices_table`
(`Retailer`,
`ProductIdentifier`,
`Datum`,
`Prijs`);
CREATE INDEX `IX_productmatch_Merk` ON `prices`.`product_match_table`
(`Merk`,
`ProductIdentifier`,
`Product`,
`Formaat`);
CREATE INDEX `IX_productmatch_Product` ON `prices`.`product_match_table`
(`Product`,
`ProductIdentifier`,
`Merk`,
`Formaat`);
Notice with the above that the field orders matter. The data (index) is sorted by the first field, then the second field, then the third field etc. To use the index effectively, your filtering/WHERE clause needs to include at least the first field, if not more.
An alternate to these indexes (the ones for filtering) is to have the original two indexes as above, but then put a separate index onto each of the fields they can search on e.g., if the users can filter on the retailer, merk and product, then create
one index on pr.Retailer
one on pm.Merk, and
one on pm.Product
Caveats
Adding indexes makes data inserts onto the relevant table (and often deletes/updates), slower than if the indexes weren't there. The reason is that it doesn't just need to update the data in the table, but it also needs to update the index(es).
Typically this is not much of a problem unless you are adding and deleting lots of data from the tables frequently. But it is worth checking your 'product maintenance' interface (e.g., adding products, updating prices etc) after adding indexes to confirm they still run well.
I have multiple tables that data can be queried from with joins.
In regards to database performance:
Should I run multiple selects from multiple tables for the required data?
or
Should I write 1 select that uses a bunch of Joins to select the required data from all the tables at once?
EDIT:
The where clause I will be using for the select contains Indexed fields of the tables. It sounds like because of this, it will be faster to use 1 select statement with many joins. I will however still test the performance difference between the 2.
Thanks for all the great answers.
Just write one query with joins. If you are concerned about performance there are a number of options including:
Creating indexes that will help the performance of your selects
Create a persisted denormalized form of the data you want so you can query one table. This would most likely be an indexed view or another table.
This can be one of those, well-gee-it-depends, but generally if you're writing straight SQL do one query--especially since the joins might limit some of the data you get back.
There is a good chance if you do multiple point queries for one record in each table, if you're using the primary key of the table for lookup, the connection cost for each query will be more costly than the actual query.
It depends on how the tables are joined. If you do a cross-product of all tables than it would be better to do individual selects. However if your tables are properly indexed and well thought out one query with multiple selects would be more efficient.
If you have proper indexes on your tables, you might be better off with the JOINs but they are often the cause of bottlenecks. Instead of multiple selects, you might look at ways to de-normalize your data. It is far less "expensive" when a user performs an operation to update a count or timestamp in multiple tables which prevents you from having to join those tables.
The best tool I find for performance tuning of queries is using EXPLAIN. You type EXPLAIN before the query and you can see how many rows are scanned. Your goal is the lower the number the better, which means your indexes are working. The other thing is when creating indexes, use compound indexes on multiple fields and order them left to right in the order they appear in the WHERE clause.
For example you have 10,000 rows in sometable:
SELECT id, name, description, status FROM sometable WHERE name LIKE '%someName%' AND status = 'Active';
You could type EXPLAIN before the query and it might return 10,000 as number of rows scanned to match. You then create a compound index:
ALTER TABLE sometable ADD INDEX idx_st_search (name, status);
You then perform the EXPLAIN on table again and it might return 1 as number of rows scanned and performance significantly improved.
Depends on your Table designs.
Most of times one large query is better but be sure to
Use primary keys in where clause as much as you can for joins.
use indexed fields or make indexes for fields which are used in where clauses.
I was wondering if anyone ever had a change to measure how a would 100 joined tables perform?
Each table would have an ID column with primary index and all table are 1:1 related.
It is a common problem within many data entry applications where we need to collect 1000+ data points. One solution would be to have one big table with 1000+ columns and the alternative would be to split them into multiple tables and join them when it is necessary.
So perhaps more real question would be how 30 tables (30 columns each) would behave with multitable join.
500K-1M rows should be the expected size of the tables.
Cheers
As a rule of thumb, anymore than 25 joins might be a performance problem. I try to keep joins below 10-15. It depends on the database activity and number of concurrent users, and the ratio of reads/writes.
Suggest you look at indexed views.
With any well tuned database, 'good' indexes for the query workload are the key .
They'd most likely perform terribly, unless you had a very small number of rows per table.
Go for a wider table, but normalize it properly. My guess is that if you normalize your data properly, you will have a slightly more sane design.
What you describe is similar to the implementation of column-oriented database (wikipedia). The data is stored in "column major" format which slows down adding each row, but is much faster for querying in the case of a where clause which restricts the returned rowset.
Why is it that you would rather split the rows? Is it that you measure the data elements for each row at different times? Or is it that the query result of a row would be very large?
Since first posting this, you answered me below that your reason for desiring a split of the table is that you usually only work with a subset of the data.
In that case, splitting the table can help your performance (amount of runtime consumed by the query) some amount. This may be an important factor in your wanting to work with less data -- in the case where your database engine runs slowly with large rows.
If performance is not a concern, rather than using SQL JOINs, it might serve you to explicitly list the columns you wish to retrieve in each query. For example, if you only wish to retrieve width, height, and length for a row, you could use:
SELECT width, height, length FROM datatable; rather than SELECT * FROM datatable; and accomplish the same improvement of getting less data returned. The SQL statements used would probably be shorter than the alternative join statements we were considering.
There's no way to better organise the tables? For example a "DataPointTypes" and "DataPointValues" table?
For example (and I don't know your particular circumstances) if all of your tables are like "WebsiteDataPoints (WebsitePage, Day, Visits)" "StoreDataPoints (Branch, Week, Sales)" etc. you could instead have
DataPointSources(Name)
(with data: Website,Store)
DataPointTypes(SourceId, ColumnName)
(with data: (Website, WebsitePage), (Website, Day), (Store, Branch), (Store, Sales) etc.)
DataPointEntry(Id, Timestamp)
DataPointValues (EntryId, Value(as varchar probably))
(with data: (1, Website-WebsitePage, 'pages.php'), (2, Store-Branch, 'MainStore'), (1, Website-Day, '12/03/1980'), (2, Store-Sales '35') etc.)
In this way each table becomes a source, each column becomes a type, each row becomes an entry, and each cell becomes a value.
Suppose I have two tables:
Group
(
id integer primary key,
someData1 text,
someData2 text
)
GroupMember
(
id integer primary key,
group_id foreign key to Group.id,
someData text
)
I'm aware that my SQL syntax is not correct :) Hopefully is clear enough. My problem is this: I want to load a group record and all the GroupMember records associated with that group. As I see it, there are two options.
A single query:
SELECT Group.id, Group.someData1, Group.someData2 GroupMember.id, GroupMember.someData
FROM Group INNER JOIN GroupMember ...
WHERE Group.id = 4;
Two queries:
SELECT id, someData2, someData2
FROM Group
WHERE id = 4;
SELECT id, someData
FROM GroupMember
WHERE group_id = 4;
The first solution has the advantage of only being one database round trip, but has the disadvantage of returning redundant data (All group data is duplicated for every group member)
The second solution returns no duplicate data but involves two round trips to the database.
What is preferable here? I suppose there's some threshold such that if the group sizes become sufficiently large, the cost of returning all the redundant data is going to be greater than the overhead involved with an additional database call. What other things should I be thinking about here?
Thanks,
Jordan
If you actually want the results joined, I believe it is always more efficient to do the joining at the server level. The SQL processor is designed to match sets of data.
If you really want the results of 2 sql statements, you can always send two statements in one batch separated by a semicolon, and get two resultsets back with one round trip to the DB.
How the data is finally used is an important and unknown factor.
I suggest the single query method for most applications. Proper indexing will keep the query more efficient than the two query method.
The single query method also has the benefit of remaining valid if you need to select more than one group.
If you are only ever going to be retreiving a single group record with each request to the database then i would go with the second option. If you are retrieving multiple group records and associated group member records, go with the join as it will be much quicker.
In general, it depends on what type of data you are trying to display.
If you are showing a single group and all its members, performance differences between the two options would be negligible.
If you are showing many groups and all of their members, the overhead of having to make a roundtrip to the database for each successive group will quickly outweigh any benefit you got from receiving a little less data.
Some other things you might want to consider in you reasoning
Result Set Size - For many groups and members, your result set size may become a limiting factor as the size to retrieve and keep it in memory increases. This is likely to occur with the second option. You may want to consider paging the data, so that you are only retrieving a certain subset at a time.
Lazy Loading - If you are only getting the members of some groups, or a user is requesting the members one group at a time, consider Lazy Loading. This means only making the additional query to get the group's members when needed. This makes sense only in certain use cases, but it can be much more effective than retrieving all data up front.
Depending on the type of database and your frontend application, you can return the results of two SQL statements on one trip (A stored procedure in SQL Server 2005 for example).
If you are creating a report that requires many fields from the Group table, you may not want the increased amount of data with the first query.
If this is some type of data entry app, you've probably already presented the Group data to the user, so they could fill in the group id on the where clause (or preferably via some parameter) and now they need the member results.
It really, really, really depends on what use you will make of the data.
For insatnce, if you were assembling a list of group members for a mail shot, and you need the group name for each letter you're going to send to a member, and you have no use for the Group level then the single joined query makes a lot of sense.
But if, say, you're coding a master-detail screen or report, with a page for each group and displaying information at both the Group and the Member levels then the two separate queries is probably most useful.
Unless you are retrieving quite large amounts of data (tens of thousands of groups with hundreds of memebers per group, or similar orders of magnitude) it is unlikely you are going to see much difference between performances of the two approaches.
On a simple query like this I would to try to perform it in one query. The overhead of two database calls will probably exceed the additional SQL processing time from the query.
A UNION clause will do this for you:
SELECT id, someData1, someData2
FROM Group
WHERE id = 4
UNION
SELECT id, someData, null
FROM GroupMember
WHERE group_id = 4;
In a DB I'm designing, there's one fairly central table representing something that's been sold or is for sale. It distinguishes between personal sales (like eBay) and sales from a proper company. This means there is literally 1 or two fields which are not equally appropiate to both cases... for instance one field is only used in one case, another field is optional in one case but mandatory in the other.
If there were more specialty it would be sensible to have a core table and then two tables with the fields relevant to the specific cases. But here, creating two tables just to contain like one field plus the reference to the core table seems both aesthetically bad, and painful to the query designer and DB software.
What do you think? Is it ok to bend the rules slightly by having a single table with weakened constraints - meaning the DB cannot 100% prevent inconsistent data being added (in a very limited way) - or do I suck it up and create dumb-looking 1-field tables?
What you're describing with one table for common columns and dependent tables for subtype-specific columns is called Class Table Inheritance. It's a perfectly good thing to do.
What #Scott Ferguson seems to be describing (two distinct tables for the two types of sales) is called Concrete Table Inheritance. It can also be a good solution depending on your needs, but more often it just makes it harder to write query across both subtypes.
If all you need is one or two columns that apply only to a given subtype, I agree it seems like overkill to create dependent tables. Remember that most brands of SQL database support CHECK constraints or triggers, so you can design data integrity rules into the metadata.
CREATE TABLE Sales (
sale_id SERIAL,
is_business INT NOT NULL, -- 1 for corporate, 0 for personal
sku VARCHAR(20), -- only for corporate
paypal_id VARCHAR(20), -- mandatory but only for personal
CONSTRAINT CHECK (is_business = 0 AND paypal_id IS NOT NULL)
);
I think the choice of having these fields is not going to hurt you today and would be the choice I would go for. just remember that as your database evolves you may need to make the decision to refactor to 2 separate tables, (if you need more fields)
There are some who insist that inapplicable fields should never be allowed, but I think this is one of those rules that someone wrote in a book and now we're all supposed to follow it without questioning why. In the case you're describing, a single table sounds like the simple, intelligent solution.
I would certainly not create two tables. Then all the common fields would be duplicated, and all your queries would have to join or union two tables. So the real question is, One table or three. But you seem to realize that.
You didn't clarify what the additional fields are. If the presence or absence of one field implies the record type, then I sometimes use that fact as the record type indicator rather than creating a redundant type. Like, if the only difference between a "personal sale" and a "business sale" is that a business sale has a foreign key for a company filled in, then you could simply state that you define a business sale as one with a company filled in, and no ambiguity is possible. But if the situation gets even slightly more complicated, this can be a trap: I've seen applications that say if a is null and b=c d / 7 = then it's record type A, else if b is null and etc etc. If you can't do it with one test on one field, forget it and put in a record type field.
You can always enforce consistency with code or constraints.
I worry a lot more about redundant data creating consistency problems then inapplicable fields. Redundant data creates all sorts of problems. Data inapplicable to a record type? In the worst case, just ignore it. If it's a "personal sale" and somehow a company got filled in, ignore it or null it out on sight. Problem solved.
If there are two distinct entities, "Personal Sales" and "Company Sales", then perhaps you ought to have two tables to represent those entities?
News flash: the DB cannot prevent 100% of corrupt data now matter which way you cut it. So far you have only considered what I call level 1 corruption (level 0 corruption is essentially what would happen if you wrote garbage over your database with a hex editor).
I have yet to see a database that could prevent level 2 corruption (syntactically correct records but when taken as a whole mean something perverse).
The PRO for keeping all fields in one table is that you get rid of JOIN's which makes your queries faster.
The CONTRA is that your table grows larger which makes your queries slower.
Which one impacts you more, totally depends on your data distribution and which queries you issue most often.
In general, splitting is better for OLTP systems, joining is better for data analysis (that tends to scan the tables).
Let's imagine 2 scenarios:
Split fields. There are 1,000,000 rows, the average row size is 20 bytes, the split field is filled once per 50 rows (i. e. 20,000 records in the split table).
We want to query like this:
SELECT SUM(mainfield + COALESCE(splitfield, 0))
FROM maintable
LEFT JOIN
splittable
ON splitid = mainid
This will require scanning 20,000,000 bytes and nested loops (or hash lookups) to find 10,000 records.
Each hash lookup is roughly equivalent to scanning 10 rows, so the total time will be equivalent of scanning 20,000,000 + 10 * 20,000 * 20 = 24,000,000 bytes
Joined fields. There are 1,000,000 rows, the average row size is 24 bytes, so the query will scan 24,000,000 bytes.
As you can see, the times tie.
However, if either parameter changes (field is filled more often or more rarely, the row size is more or less, etc), one or another solution will become better.